-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest/ledgerbackend: add trusted hash to captive core catchup #5431
Conversation
…for tx size limit test
@@ -566,6 +567,13 @@ func (i *Test) waitForCore() { | |||
if durationSince := time.Since(infoTime); durationSince < coreStartupPingInterval { | |||
time.Sleep(coreStartupPingInterval - durationSince) | |||
} | |||
cmdline := []string{"-f", integrationYaml, "logs", "core"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this for visibility on core startup during test execution, as the docker compose container output is detached, this helped identify a core dump that was happening in the container as the horizon test logs just repeat 'cannot connect to localhost:11626'
last = &boundedTo | ||
} | ||
|
||
c.lastLedger = last |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the key to being able to (re)use live runFrom
with bounded range, trigger the captive core to be stopped once it emits ledger on pipe that matched up to the bounded to
if exists, which would have been same as core catchup
process terminating at point of same to
ledger emitted.
# Any range should do for basic testing, this range was chosen pretty early in history so that it only takes a few mins to run | ||
docker run -e BRANCH=$(git rev-parse HEAD) -e FROM=10000063 -e TO=10000127 stellar/horizon-verify-range | ||
# Use small default range of two most recent checkpoints back from latest archived checkpoint. | ||
docker run -e TESTNET=true -e BRANCH=$(git rev-parse HEAD) -e FROM=0 -e TO=0 stellar/horizon-verify-range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using latest checkpoint on testnet for verify-range, tried similar on pubnet, takes 45+ minutes, mostly on change processor ledger entry processing.
@sreuland could you provide some data on how much slower |
…md flag parsing as it's needed now since captive core does run mode
On mac/m1, with core v21.3.1, pubnet network:
i'm running older/larger ranges, will post results once complete. |
@sreuland I think we need to take a step back and consider alternative implementations. The 26 minute overhead on a recent 64 ledger range is pretty significant. Assuming the overhead increases linearly as you go further back in time, it seems likely that this issue becomes worse when reingesting older ledger ranges. The overhead of using You mentioned Another idea to consider is that we can verify the ledger chain at the end of a running reingestion task. After ingesting a bounded range with captive core we can check the ledgers at the boundaries of that range in the horizon db. For example if we reingest the bounded range Also, when ingesting an unbounded range with captive core we would need check that the start ledger's previous ledger hash matches whatever we have in the horizon db. Howevr, one of the downsides of this approach is that if there is a ledger mismatch we only find out at the very end. It would be helpful to have a design document which outlines different possible implementations with their pros / cons to make it easier to analyze which is the best solution. |
yes, good suggestion, I've started design doc to continue discussion on options. |
Ah, good call out, I think this could be a separate scope/ticket, created #5450, to add current network validation via ledger hash to horizon's db as part of forward 'live' ingestion mode? This ticket was focused on hash validation specific to reingestion of bounded offline. |
closing this pr as obsolete now in favor of a different design approach mentioned for repairing skip list usage in captive core binary instead, once this repair propagates, it will enable existing 'catchup' to emit anchored(to current network) trusted hashes within ledger metadata. |
PR Checklist
PR Structure
otherwise).
services/friendbot
, orall
ordoc
if the changes are broad or impact manypackages.
Thoroughness
.md
files, etc... affected by this change). Take a look in the
docs
folder for a given service,like this one.
Release planning
CHANGELOG.md
within the component folder structure. For example, if I changed horizon, then I updated (services/horizon/CHANGELOG.md. I add a new line item describing the change and reference to this PR. If I don't update a CHANGELOG, I acknowledge this PR's change may not be mentioned in future release notes.semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.
What
The Captive Core backend now performs 'online' stellar-core
run
for bounded modes of tx-meta retrieval, which will be used for Horizon'sdb reingest range
andingest verify-range
commands.Using
run
enables core to build, validate, and emit trusted ledger hashes into the tx-meta stream for the requested ledger range. The bounded range commands will no longer do the 'offline' mode of running corecatchup
for generating tx-meta from just history archives, which does not guarantee verification of the ledger hashes to that of live network as mentioned in (#4538).Due to the usage of
run
with LCL set to thefrom
, there is now potential for longer run timereingest
andverify-range
execution durations due to core having to perform online replay upfront from network latest ledger back tofrom
. The longer runtime duration will be proportional to the older age of thefrom
ledger.Why
Prevents consuming tx-meta from a potentially tampered history archives server.
Closes #4538
Known limitations
The execution duration will be longer for bounded ranges with relatively old 'from' ledger sequence compared to latest network age(ledger sequence).
Evaluated other paths for obtaining trusted hashes, by enhancing the usage of existing
catchup
first, what was discovered is that command usage with trusted hash input flags leads to complex scenarios for obtaining the hash(es) in deployments and out-of-band from the horizon command(reingest
,verify-state
) invocation:--trusted-checkpoint-hashes
- to use this withcatchup
, need to first runverify-checkpoints
from core to build the file, this will take just as long asrun
with LCL=1 at first time. After that, need to find a persistent disk storage path to keep the file, which contradicts thereingest
cli expectations, which allows (re)running from any host/path, and always creates a new storage path for parallel purposes, and avoiding conflicts in general with multiple captive instances on same host. This doesn't scale for the range of uses cases thatreingest
can be used and would impinge on it.--trusted-hash
- to use this withcatchup
, requires first getting the live network side trusted hash for the intended 'to' ledger first, this can only be done viarun
with LCL={to-1} and ingesting the 'to' ledger and then stopping core.run
with LCL={from-1} and collects the meta for range from the online 'internal catchup/replay' mode that core runs regardless, whereas this would generate the meta for range from offlinecatchup
command.